Q-raKtion: A Semiautomated KNIME Workflow for Bioactivity Data Points Curation

The recent increase of bioactivity data freely available to the scientific community and stored as activity data points in chemogenomic repositories provides a huge amount of ready-to-use information to support the development of predictive models. However, the benefits provided by the availability of such a vast amount of accessible information are strongly counteracted by the lack of uniformity and consistency of data from multiple sources, requiring a process of integration and harmonization. While different automated pipelines for processing and assessing chemical data have emerged in the last years, the curation of bioactivity data points is a less investigated topic, with useful concepts provided but no tangible tools available. In this context, the present work represents a first step toward the filling of this gap, by providing a tool to meet the needs of end-user in building proprietary high-quality data sets for further studies. Specifically, we herein describe Q-raKtion, a systematic, semiautomated, flexible, and, above all, customizable KNIME workflow that effectively aggregates information on biological activities of compounds retrieved by two of the most comprehensive and widely used repositories, PubChem and ChEMBL.

Step 3: Activity data curation 1.4 Step 4: Data Integration Figure S1 3.2 Figure S2 3.3 Figure S3 3.4 Table S1 3.5 Table S2 By using external files. Alternatively, the list of assays can be split externally in csv files containing the list of target-based and cell-based assays. These two files can be then loaded to the workflow (see figure below).

3.1
Finally, the datapoints are split accordingly to the assigned ontology class (Target-and Cell-based associated activity datapoints metanodes).

1.3
Step 3: Activity data curation Firstly, a list of non-redundant activity types (e.g. IC 50 , K i , K d , etc.) is derived by using the 'Standard Type' and 'acname' properties for ChEMBL and Pubchem (metanode Activity type list), respectively. Secondly, we have developed a protocol to insert a quality label on biological activity associated to a compound. Indeed, it is well known that the performance of a predictive model strongly depends on the quality of the training data. The application of this quality control protocol results in the assignment of a confidence class that indicates the quality of the activity measure and its consistence with respect to other available data.
A numerical priority value (Activity-type priority assessment component) can be assigned by the user to discriminate high from low informative activity types. As default setting, high priority (priority ≤ 3) are assigned to XC 50 (i.e. IC 50 , EC 50 and GI 50 ) and K X (i.e. K i and K d ) measurements, while a lower priority (priority > 3) is used to less precise (e.g. % inhibition, % enzyme control activity) or to misleading (e.g. Activity, Inhibition, NULL) activity types. The datapoints are splitted based on the assigned activity priority (Rulebased row splitter node), and thus high and low informative activity types are processed separately.
The XC 50 and K X values are then converted into the corresponding pXC 50 and pK X values (-log 10 of the original measure; pValue determination metanode) to allow datapoints comparison regardless the order of magnitude (i.e. µM or nM). This step coupled to the previous classification (and datapoints splitting) in target-based and cell-based assays, ensures that all the data provided as input in the next steps (e.g. to the metanode Confidence Class Assigner) could be compared in terms of assay ontology and activity pValue.
At this point, a list of non-redundant compounds is generated and for each of them all the available datapoints are analyzed. We have developed the metanode Confidence Class Assigner to automatically determine the quality of the analyzed datapoint. Specifically, based on the activity types (e.g. IC 50 , K i , K d , etc, collected in the "Standard relation" and "acvalue" columns for ChEMBL and PubChem, respectively) and activity qualifier (i.e. "=", ">", "≥", "<" or "≤" collected in the "Standard relation" and "acvalue" columns for ChEMBL and PubChem, respectively), the metanode automatically assign a confidence class that can range from A to D.
When more than one data is available for the same activity type (IC 50 -1, IC 50 -2) for a given compound, only the best activity value is retained and the data consistence is assessed by calculating the difference between the maximum and minimum corresponding pValue ("ΔpValue" property); The activity type, the activity qualifier (i.e. "=", ">", "≥", "<" or "≤") and the "ΔpValue" property are then used to assign a datapoint confidence class. Each row that the Confidence Class Assigner metanode provides as output refers to a single activity value for a single activity type of a given compound coupled with the proper confidence class.
The multiple annotated activities corresponding to the same compound are then aggregated in a unique row (Per-compound Activities Concatenation metanode/Concatenate node/GroupBy node). The outputs of these operations are four datasheets corresponding to two datasheets for both PubChem and CheEMBL, respectively, reporting the target-based and cell-based associated activity datapoints. In these tables, a row contains a unique molecule for which the best activity datapoint of each activity type is reported and flagged with the corresponding confidence class.

Step 4: Data integration
At first, two separated and comprehensive datasets of ChEMBL and PubChem are created (Curated Dataset (ChEMBL) and Curated Dataset (PubChem) metanodes). Towards this aim, the target-based and cell-based activity datapoints associated to the same compound identifier (as found in the "ChEMBL ID" and "PubChem CID" for ChEMBL and PubChem, respectively) are combined in a unique row.
These two datasets are finally merged in a unique final dataset. However, given the absence of a common identifier, the workflow exploits the PubChem Identifier Exchange Service (available free of charge at https://pubchem.ncbi.nlm.nih.gov/idexchange/idexchange.cgi) to allow the user to retrieve the corresponding PubChem CID identifiers for ChEMBL chemical structures. Specifically, the workflow generates a file with the smile structures of ChEMBL compounds (CSV writer node). This file can be uploaded in the previously mentioned webpage selecting "SMILES" as Input ID list option. This search returns a table containing the input smiles and the corresponding CIDs that can be uploaded into the workflow using the Decompress Files node, thus allowing the assignment of the proper PubChem identifier (i.e. CID) to each ChEMBL compound. The PubChem Identifier Exchange Service can return a result in which some compounds are missing. The CIDs of the missing compounds can manually search and provided to the workflow using the second File reader -missing CIDs node. In the case of workflow application on AKT1 protein, the ChEMBL IDs of the missing compounds were used as query in the PubChem Identifier Exchange Service instead of the smiles structure. The service provided a list of synonyms for each ChEMBL ID, among which the PubChem CID was reported. Starting from this output a file reporting the ChEMBL ID and the corresponding CID was created and uploaded.
In the final step, the comprehensive datasets of ChEMBL and PubChem are merged, and, for compounds having the same CID, the datapoints are collected in a unique row (ChEMBL & PubChem Merged Tables metanode).
In the merging process, only the XC 50 (e.g. IC 50 , EC 50 and GI 50 ) and K X (e.g. K i and K d ) measurements are considered, as these are high informative activity types shared by both databases (ChEMBL & PubChem Curated Summary Table metanode). When two different activity values of the same type (e.g. IC 50 ) are available both from ChEMBL and PubChem, only the best datapoint is retained and the confidence class is updated applying the following rules in a stepwise manner: -When both activities are in A confidence class and the "∆pValue" is ≤ 0.5, the highest value is retained, and the A flag is assigned.
-When both activities are in A confidence class and the "∆pValue" is ≥ 0.5, the highest value is retained,and the B flag is assigned.
-When one of the two activities is in D confidence class, the highest value is retained, and the D flag is assigned.
-When one of the two activities is in B confidence class, the highest value is retained, and the B flag is assigned.
-When a compound is flagged as "inactive" in the "activity" property in the PubChem original datasheet and no additional activities are reported on ChEMBL or PubChem, the compound is flagged as truly inactive and assigned to the A confidence class.

Collection of AKT1 activity datapoints from ChEMBL and PubChem Data
The UniProt ID P31749, corresponding to the AKT1 protein, was used to query the web interfaces of ChEMBL and PubChem databases (access date December 2021).
From ChEMBL 8795 datapoints were downloaded selecting the target ID CHEMBL4282 single protein (https://www.ebi.ac.uk/chembl/g/#browse/activities/filter/target_chembl_id%3ACHEMBL4282). From this webpage all the available activities were selected and downloaded as csv file.
Within PubChem, the search produced two results, one associated to the AKT1 gene page and one to the AKT1 protein page. Only the results concerning the protein tab were used for this work. From the PubChem webpage of the AKT1 protein (RAK-alpha serine/threonine-protein kinase https://pubchem.ncbi.nlm.nih.gov/protein/P31749#section=Chemicals-and-Bioactivities), we selected and downloaded all the tested compounds in the Chemicals and Bioactivities section as csv file. In total, 366895 bioactivity datapoints were downloaded. Figure S1. Detailed view of the Q-raKtion organization. The input tables downloaded from ChEMBL and PubChem are loaded into the workflow ( "Input data loading": red rows, left part), and through automated and manually-assisted (blue box) data curation steps, Q-raKtion returns the final dataset of compounds with the corresponding annotated activities (red row, right part). Figure S2. Example of application on AKT1 protein. The "BAO label editing" and "Assay Ontology Curation" components provide the interactive tables for the manual curation of the bioassays ontology illustrated in panel A and B, respectively. The column Ontology Class is editable to manually assign the proper bioassays ontology class based on the "BAO label" column or the assay description property ("aidname" column for PubChem, "Assay Description" column for ChEMBL). Figure S3. Example of application on AKT1 protein. The "Activity-type priority assessment" component provides the illustrated interactive table for the ChEMBL datapoints corresponding to AKT1 activities. The Activity Priority column is editable and can be used i) to discriminate high informative from low informative activity types and ii) to aggregate identical activity types (e.g. "% Control" and "% Ctrl")  -408 (2005). In the preparation of AKT1 and AKT2 human AKT1 and AKT2 to which a middle T antigen tag was added were expressed in Sf 9 insect cells and then AKT1 and AKT2 were prepared following affinity purification and activation by PDK1. The prepared AKT1 and AKT2 were stored at ÃƒÂ¢ 80C. until the time of measurement of inhibitory activity of the compounds. In the measurement of inhibitory activity of the compounds AKT1 or AKT2 and each compound of the present invention were preincubated at 25C. for 120 minutes in a buffer solution for reaction (15 mM Tris-HCl pH 7.5 0.01% Tween-20 2 mM DTT). As a substrate biotinylated Crosstide (bioton-KGSGSGRPRTSSFAEG) MgCl2 and ATP were added to final concentrations.

Supporting Figures and Tables
Target-based assay 64 cell-based format CHEMBL3705788 TR-FRET Assay: Akt1 inhibitory activity of compounds of the present invention may be quantified0 employing the Akt1 TR-FRET assay as described in the following paragraphs. His-tagged human recombinant kinase full-length Akt1 expressed in insect cells was purchased form Invitrogen (part number PV 3599). As substrate for the kinase reaction the biotinylated peptide biotin-Ahx-KKLNRTLSFAEPG (C-terminus in am-ide form) was used which can be purchased e.g. from the company Biosynthan GmbH (Berlin-Buch Germany).For the assay 50 nl of a IOOfold concentrated solution of the test compound in DMSO was pipetted into a black low volume 384well microtiter plate (Greiner Bio-One Frickenhausen Germany) 2 ul of a solution of Akt1 in assay buffer [50 mM TRIS/HCI pH 7.5 5 mM MgCI2 1 mM dithiothreitol 0.02% (v/v) Triton X-100 (Sigma)] were added and the mixture was incubated for 15 min at 22C to allow pre-binding of the test compounds to the enzyme before the start of the kinase reaction.
Target-based assay 93 cell-based format CHEMBL3706239 TR-FRET assay: Akt1 inhibitory activity of compounds of the present invention was quantified employing the Akt1 TR-FRET assay as described in the following paragraphs. His-tagged human recombinant kinase full-length Akt1 expressed in insect cells was purchased form Invitrogen (part number PV 3599). As substrate for the kinase reaction the biotinylated peptide biotin-Ahx- Target-based  assay  38 cell-based format KKLNRTLSFAEPG (C-terminus in amide form) was used which can be purchased e.g. from the company Biosynthan GmbH (Berlin-Buch Germany).For the assay 50 nl of a 100 fold concentrated solution of the test compound in DMSO was pipetted into a black low volume 384 well microtiter plate (Greiner Bio-One Frickenhausen Germany) 2 ul of a solution of Akt1 in assay buffer [50 mM TRIS/HCl pH 7.5 5 mM MgCl2 1 mM dithiothreitol 0.02% (v/v) Triton X-100 (Sigma)] were added and the mixture was incubated for 15 min at 22 C. to allow pre-binding of the test compounds to the enzyme before the start of the kinase reaction.